CONTENTS

1. CordellClayFarissWoodWright-DisaggregatingRepression.R = R code used to produce all tables, figures and descriptive statistics in manuscript

2. snarp_allegation_data.csv = Dataset containing all 2,013,199 sentences extracted from the SNARP corpus text files (7,445 individual text files for 196 countries from 1999 to 2016). All sentences are coded according to whether they contain information on physical integrity rights allegations.

- doc.id: File name (iso3c_year_source)                 
- iso3c: ISO-3 character  
- ccode: Correlates of War numeric
- country: country name (English)               
- year: Observation year                   
- source.short: Reporting organization (first letter)           
- sentence.id: Unique sentence id in corpus            
- line.no: Line number of sentence in original text file               - original.text: Original text of sentence (before cleaning)          
- clean.text: Cleaned version of original.text (corrected for errors, removed line numbers, stripped unnecessary white space, removed encoded text and non-ascii characters, converted to lower case)             
- stemmed.text: Stemmed version of clean.text (words stemmed to their base root form) 
- train.dict.text: Reduced version of stemmed.text including terms in our training dictionary only         
- common.text: Reduced version of stemmed.text including non-sparse terms only     - train.dict.common.text: Combined text of train.dict.text and train.common text 
- word.count: Total number of terms in a sentence   
- train.dict.word.count: Number of terms in a sentence in our training dictionary   
- train.target: Binary variable = 1 if a sentence is in our training data  
- test.india.target: Binary variable = 1 if a sentence is in our India test data      
- test.random.target: Binary variable = 1 if a sentence is in our random sample test data              
- svm.pred: Binary variable = 1 if a sentence is coded by the Support Vector Machine model as a physical integrity rights allegation               
- nb.pred: Binary variable = 1 if a sentence is coded by the Naive Bayes model as a physical integrity rights allegation               - logistic.pred: Binary variable = 1 if a sentence is coded by the Logistic Regression model as a physical integrity rights allegation          
- majority.pred: Binary variable = 1 if a sentence is coded by the Majority Vote model as a physical integrity rights allegation          
- logistic.prob.0: Predicted probability = high value if a sentence is not predicted by the Logistic Regression model as a physical integrity rights allegation        
- logistic.prob.1: Predicted probability = high value if a sentence is predicted by the Logistic Regression model as a physical integrity rights allegation
- alleg.prob: Predicted probability of the sentence being a physical integrity rights allegation (same as logistic.prob.1)
- alleg.dummy: Binary classification of the sentence being a physical integrity rights allegation (same as majority.pred)

3. test_data_dtm.csv = Document term matrix for our training data used to calculate in-sample model accuracy
4. training_data_dtm.csv = Document term matrix for our test data used to calculate out-of-sample model accuracy

5. HumanRightsProtectionScores_v4.01.Rdata = Political Terror Scale and Latent Human Rights Protection Scores used for external validation of our models. Christopher J. Fariss, Michael R. Kenwick, and Kevin Reuning. "Estimating one-sided-killings from a robust measurement model of human rights" 57(6):801-814 (November 2020).